best system
What are the best Systems? New Perspectives on NLP Benchmarking
In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in {\it (i)} assessing the progress of new methods along different axes and {\it (ii)} selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (\textit{e.g.} GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (\textit{e.g.} GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure while being both more reliable and robust.
What are the best Systems? New Perspectives on NLP Benchmarking
In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in {\it (i)} assessing the progress of new methods along different axes and {\it (ii)} selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (\textit{e.g.} GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks.
Reviews: Decoding with Value Networks for Neural Machine Translation
This paper addresses one of the limitation of NMT, the so-called exposure bias, that results from the fact that each word is chosen greedily. For this, the authors build on standard technique of reinforcement learning and try to predict, for each outgoing transition of a given state, the expected reward that will be achieved if the system take this transition. The article is overall very clear and the proposed ideas quite appealing, even if many of the decisions seem quite ad hoc (e.g. More importantly, several implementation "details" are not specified. For instance, in Equation (6), the BLEU function is defined at the sentence level while in the actual BLEU metric is defined at the corpus level.
What are the best systems? New perspectives on NLP Benchmarking
In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions.
What are the best systems? New perspectives on NLP Benchmarking
Colombo, Pierre, Noiry, Nathan, Irurozki, Ekhine, Clemencon, Stephan
In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (e.g. GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure while being both more reliable and robust.
- Europe > France (0.04)
- North America > United States > Maryland > Montgomery County > Gaithersburg (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (0.86)
Selecting the Best Optimizing System
We formulate selecting the best optimizing system (SBOS) problems and provide solutions for those problems. In an SBOS problem, a finite number of systems are contenders. Inside each system, a continuous decision variable affects the system's expected performance. An SBOS problem compares different systems based on their expected performances under their own optimally chosen decision to select the best, without advance knowledge of expected performances of the systems nor the optimizing decision inside each system. We design easy-to-implement algorithms that adaptively chooses a system and a choice of decision to evaluate the noisy system performance, sequentially eliminates inferior systems, and eventually recommends a system as the best after spending a user-specified budget. The proposed algorithms integrate the stochastic gradient descent method and the sequential elimination method to simultaneously exploit the structure inside each system and make comparisons across systems. For the proposed algorithms, we prove exponential rates of convergence to zero for the probability of false selection, as the budget grows to infinity. We conduct three numerical examples that represent three practical cases of SBOS problems. Our proposed algorithms demonstrate consistent and stronger performances in terms of the probability of false selection over benchmark algorithms under a range of problem settings and sampling budgets.
- North America > United States > California > Alameda County > Berkeley (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- (3 more...)
Automated Question Answering System for Community-Based Questions
Pithyaachariyakul, Chanin (San Francisco State University) | Kulkarni, Anagha (San Francisco State University)
Answer (Y!A), and Quora, indicate that for certain information needs, users prefer receiving focused answers to their questions, rather than a list of URLs from search results. This trend has sparked a rich area of investigation at the intersection of Information Retrieval (IR), Natural Language Processing (NLP), and Machine Learning (ML) of Automated Question Answering (QA).
- North America > United States > California > San Francisco County > San Francisco (0.15)
- Europe > France > Auvergne-Rhône-Alpes > Lyon > Lyon (0.05)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.05)
MIT's new AI could eliminate video buffering woes
MIT discovered a way to improve video streaming by reducing buffering times and pixelation. A new AI developed at the university's Computer Science and Artificial Intelligence Laboratory uses machine learning to pick different algorithms depending on network conditions. In doing so, the AI, called Pensieve, has been shown to deliver a higher-quality streaming experience with less buffering than existing systems. Streaming sites use ABR algorithms to determine which resolution videos will play at. Instead of sending a video to your computer in one complete piece, it breaks it up into smaller pieces and sends them sequentially.